Cox Proportional Hazards Regression

Kim Johnson

3/25/2020

Outline

  1. Overview
  2. The Cox model and key features
  3. Partial liklihood estimation
  4. Model fitting and significance testing
  5. Handling tied survival times
  6. Interpreting the estimate model using HRs
  7. Interpeting the estimated model using model-based survival curves

Cox model

Assumptions (similar to other models except #5)

  1. Independence of observations
  2. linear relationship between continous variables and outcome
  3. No influential observations
  4. Multiplicative relationship between predictors and hazard
  5. Constant hazard over time

Key features

Partial likelihood estimation

Partial likelihood demo

  1. Develop a likelihood function (which maximizes the likelihood that the model can reproduce your sample)
  2. Plug in starting values for \(\beta\)’s
  3. Use a numeric approach to test different sets of \(\beta\)’s through iteration until you get convergence where the liklihood no longer changes.

Partial likelihood estimation (only for single parameter no ties data)

-Step 1: Develop PL function - Sort data in ascending order of T (survival times) - The hazard for individual 1 is:

Partial likelihood estimation continued

Partial likelihood estimation continued

Partial likelihood estimation continued

and defines the likelihood for a censoring case (e.g. case 21, dead=0) as

Partial likelihood estimation continued

Partial likelihood estimation continued

-Step 2: Plug in starting value of \(\beta\)

-Step 3: Search the best \(\beta\) until you maximize the log PL

Partial likelihood estimation continued

Model fitting and sigificance testing

  1. The Wald test (recommended)

  2. The partial likelihood ratio test: G=2[L\(_{p}\)(M1)-L\(_{p}\)(M0)] where L\(_{p}\)(M1) is the log likelihood of the model containing the covariates being tested and L\(_{p}\)(M0) is the log likelihood of the null model. We can also use this test to compare models as we have done previously. The degrees of freedom to perform a chi-square test is the number of test covariates

  3. The score test

Handling tied survival times (cases with same event times)

Handling tied survival times continued

Hazard ratio interpretations

Hazard ratio interpretations continued

Hazard ratio interpretations continued

Model-predicted survivor curves

Model-predicted survivor curves continued

PH assumption testing

  1. Graphical
  2. goodness-of-fit tests
  3. time-dependent variables

What to do if the assumption is violated?

  1. Nothing not a big deal
  2. Stratified Cox procedure
    • splits sample into subsamples on levels of stratification variable
    • assumes different baseline hazards
    • get different survival curves for each strata but one HR
    • used for adjusting variable and not main variable of interest
  3. Report hazard ratios for varying follow-up intervals
  4. Restricted mean survival time models

Let’s do some of the modeling in R using the leukemia dataset from last week. This demo will again use the leukemia remission/death data that we used last week with the addition of sex as a variable. We will use Cox proportional hazards regression to model the hazard of leukemia relapse in the untreated vs. treated groups.

Install packages

#install.packages("survival") #for survival analysis by group
#install.packages('ggfortify') #for survival analysis by group
#install.packages("survminer") #for pairwise diffs
#install.packages("readxl") #for importing excel datasets
#install.packages("tidyverse")
#install.packages("lmtest")
#install.packages("MASS")
library(survminer)#for pairwise diffs
## Loading required package: ggplot2
## Loading required package: ggpubr
## Loading required package: magrittr
library(survival) #for calculating KM values
library(MASS)#for log log survival curves
library(ggfortify) #for KM curves
library(readxl) # for reading in excel file
library(ggplot2) # for plotting KM curve
library(tidyverse) # for various packages
## ── Attaching packages ──────────────────────────────────── tidyverse 1.3.0 ──
## ✔ tibble  2.1.3     ✔ dplyr   0.8.4
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ✔ purrr   0.3.3
## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::select()    masks MASS::select()
## ✖ purrr::set_names() masks magrittr::set_names()
library(lmtest) #model comparison
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(foreign)

Import dataset and do data managment needed for this exercise

leuk <- read.dta("http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/anderson.dta")# leukemia remission dataset with sex variable found on the internet. Please note in this version of the dataset some of the variable names have changed


#label the group variable
leuk$rx<-factor(leuk$rx,
      levels = c(0,1),
      labels = c("Treated", "Untreated"))

leuk$id <- rownames(leuk)

Which group has a higher hazard of relapse (treated or untreated)? Run univariate Cox model for examine the association between group and leukemia relapse. Interpret the results.

treat.mod<-coxph(Surv(survt, status)~rx, leuk, ties="efron") #using ties = Efron, default is Efron, which is fine but this is how it would be changed.
summary(treat.mod)
## Call:
## coxph(formula = Surv(survt, status) ~ rx, data = leuk, ties = "efron")
## 
##   n= 42, number of events= 30 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)    
## rxUntreated 1.5721    4.8169   0.4124 3.812 0.000138 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## rxUntreated     4.817     0.2076     2.147     10.81
## 
## Concordance= 0.69  (se = 0.041 )
## Likelihood ratio test= 16.35  on 1 df,   p=5e-05
## Wald test            = 14.53  on 1 df,   p=1e-04
## Score (logrank) test = 17.25  on 1 df,   p=3e-05
#Interpretation: Those who were not treated had a 4.82 (95% CI 2.15-10.81) times higher hazard of relapse than those who were not treated.

treat.mod1<-coxph(Surv(survt, status)~rx, leuk, ties="breslow") #using ties = Breslow--the default for SAS
summary(treat.mod1)
## Call:
## coxph(formula = Surv(survt, status) ~ rx, data = leuk, ties = "breslow")
## 
##   n= 42, number of events= 30 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)    
## rxUntreated 1.5092    4.5231   0.4096 3.685 0.000229 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## rxUntreated     4.523     0.2211     2.027     10.09
## 
## Concordance= 0.69  (se = 0.041 )
## Likelihood ratio test= 15.21  on 1 df,   p=1e-04
## Wald test            = 13.58  on 1 df,   p=2e-04
## Score (logrank) test = 15.93  on 1 df,   p=7e-05

Adjust the Cox model for logwbc and interpret the results.

treat_adj.mod<-coxph(Surv(survt, status)~rx + logwbc, leuk)
summary(treat_adj.mod)
## Call:
## coxph(formula = Surv(survt, status) ~ rx + logwbc, data = leuk)
## 
##   n= 42, number of events= 30 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)    
## rxUntreated 1.3861    3.9991   0.4248 3.263   0.0011 ** 
## logwbc      1.6909    5.4243   0.3359 5.034  4.8e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## rxUntreated     3.999     0.2501     1.739     9.195
## logwbc          5.424     0.1844     2.808    10.478
## 
## Concordance= 0.852  (se = 0.04 )
## Likelihood ratio test= 46.71  on 2 df,   p=7e-11
## Wald test            = 33.6  on 2 df,   p=5e-08
## Score (logrank) test = 46.07  on 2 df,   p=1e-10
#Interpretation: After adjusting for logwbc, those who were not treated had a 4.0 (95% CI 1.74-9.20) times higher hazard of leukemia relapse than those who were treated.

Include an interaction term in the model between Group and log_WBC to see if there is effect modification of the hazard of leukemia relapse in those who were not treated vs. treated according to their log_WBC. Interpret the results.

treat_int.mod<-coxph(Surv(survt, status)~rx  + logwbc + logwbc*rx, leuk)
summary(treat_int.mod)
## Call:
## coxph(formula = Surv(survt, status) ~ rx + logwbc + logwbc * 
##     rx, data = leuk)
## 
##   n= 42, number of events= 30 
## 
##                       coef exp(coef) se(coef)      z Pr(>|z|)    
## rxUntreated         2.3749   10.7500   1.7055  1.393    0.164    
## logwbc              1.8724    6.5040   0.4514  4.148 3.35e-05 ***
## rxUntreated:logwbc -0.3175    0.7280   0.5258 -0.604    0.546    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                    exp(coef) exp(-coef) lower .95 upper .95
## rxUntreated           10.750    0.09302    0.3799    304.16
## logwbc                 6.504    0.15375    2.6850     15.75
## rxUntreated:logwbc     0.728    1.37372    0.2597      2.04
## 
## Concordance= 0.851  (se = 0.041 )
## Likelihood ratio test= 47.07  on 3 df,   p=3e-10
## Wald test            = 32.39  on 3 df,   p=4e-07
## Score (logrank) test = 49.86  on 3 df,   p=9e-11
#Interpretation: There is no significant effect modification (p for interaction=0.546) of the HR for the association between treatment and leukemia relapse by logwbc.

Compare models using the likelihood ratio test (a measure of model fit) and interpret the findings.

lrtest(treat.mod, treat_adj.mod)
## Likelihood ratio test
## 
## Model 1: Surv(survt, status) ~ rx
## Model 2: Surv(survt, status) ~ rx + logwbc
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   1 -85.008                         
## 2   2 -69.828  1 30.361  3.587e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The likelihood ratio test indicates that the model that includes log_WBC significantly improves fit compared to the model with just treatment (p=3 x 10-10). 

Plot survival curves adjusted for mean log_WBC

# from http://www.sthda.com/english/wiki/cox-proportional-hazards-model 
# Create the new data for plotting adjusted survival curves for each treatment group using log_WBC set at the mean
trt_df <- with(leuk, data.frame(rx = c("Untreated", "Treated"), logwbc=rep(mean(logwbc, na.rm = TRUE),2)))
trt_df
##          rx   logwbc
## 1 Untreated 2.930238
## 2   Treated 2.930238
#problem with survminer ggsurvplot function that won't allow it to take model objects solved with code below. see: https://github.com/kassambara/survminer/issues/324
fit<-survfit(treat_adj.mod, newdata = trt_df)
fit$call$formula <- eval(fit$call$formula)

ggsurvplot(fit, data=leuk, conf.int = TRUE, legend.labs=c("Untreated", "Treated"),  ggtheme = theme_minimal()) 

#change X-axis limits for fun

ggsurvplot(fit, data=leuk, conf.int = FALSE, legend.labs=c("Untreated", "Treated"), xlim = c(0, 35), ylim= c(0,1), ggtheme = theme_minimal()) 

#We can see from these curves after adjusting for log_WBC, that at almost all time points there is a higher survival probability in the treated group than in the untreated group.

#what about unadjusted KM curves--how do they compare?
leukemia.surv <- survfit(Surv(survt, status)~rx, leuk)

ggsurvplot(leukemia.surv, data = leuk, conf.int=FALSE, ggtheme = theme_minimal(), tables.theme = clean_theme())

We can also plot log_WBC adjusted survival curves by sex

# Need to rerun the cox model with sex in it:
sex_treat_adj.mod<-coxph(Surv(survt, status)~rx + logwbc + sex, leuk)
summary(sex_treat_adj.mod)
## Call:
## coxph(formula = Surv(survt, status) ~ rx + logwbc + sex, data = leuk)
## 
##   n= 42, number of events= 30 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)    
## rxUntreated 1.5036    4.4978   0.4615 3.258  0.00112 ** 
## logwbc      1.6819    5.3760   0.3366 4.997 5.82e-07 ***
## sex         0.3147    1.3698   0.4545 0.692  0.48872    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## rxUntreated     4.498     0.2223    1.8204    11.113
## logwbc          5.376     0.1860    2.7794    10.398
## sex             1.370     0.7300    0.5621     3.338
## 
## Concordance= 0.851  (se = 0.041 )
## Likelihood ratio test= 47.19  on 3 df,   p=3e-10
## Wald test            = 33.54  on 3 df,   p=2e-07
## Score (logrank) test = 48.01  on 3 df,   p=2e-10
# Create the new data for plotting adjusted survival curves for each treatment group using log_WBC set at the mean  
sex_trt_df <- with(leuk, data.frame(sex=c(0,1), rx=1, logwbc=rep(mean(logwbc, na.rm = TRUE),2))) #"The curve(s) produced will be representative of a cohort whose covariates correspond to the values in newdata" (https://stat.ethz.ch/R-manual/R-patched/library/survival/html/survfit.coxph.html)
sex_trt_df
##   sex rx   logwbc
## 1   0  1 2.930238
## 2   1  1 2.930238
#here is the code again to fix the error noted above with ggsurvplot
fit2<-survfit(sex_treat_adj.mod, newdata = sex_trt_df)
## Warning in model.frame.default(data = structure(list(sex = c(0, 1), rx = c(1, :
## variable 'rx' is not a factor
fit2$call$formula <- eval(fit2$call$formula)

ggsurvplot(fit2, data=leuk, conf.int = FALSE, legend.labs=c("Sex=0", "Sex=1"),  ggtheme = theme_minimal()) 

#compare to unadjusted KM curves for sex
leukemia.surv <- survfit(Surv(survt, status) ~ sex, leuk)

#Using survminer library to calculate KM plots with confidence intervals
ggsurvplot(leukemia.surv, data = leuk,  ggtheme = theme_minimal(), tables.theme = clean_theme())

Besides the PH assumption that we will test next week, there are other diagnostic tests you can do to check for outliers and non-linearity as we did in logistic regression. I show some code for that here for your reference. See: https://www.r-bloggers.com/cox-model-assumptions/ for further details.

#check for linearity of the log_WBC term
log_wbc.times.lwbc3<- leuk$lwbc3 * log(leuk$lwbc3)#create term to test linearity

boxTidwelllwbc3 <- coxph(Surv(survt, status)~rx + logwbc + sex + log_wbc.times.lwbc3, leuk)
summary(sex_treat_adj.mod) #Box Tidwell technique, test the assumption of linearity
## Call:
## coxph(formula = Surv(survt, status) ~ rx + logwbc + sex, data = leuk)
## 
##   n= 42, number of events= 30 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)    
## rxUntreated 1.5036    4.4978   0.4615 3.258  0.00112 ** 
## logwbc      1.6819    5.3760   0.3366 4.997 5.82e-07 ***
## sex         0.3147    1.3698   0.4545 0.692  0.48872    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## rxUntreated     4.498     0.2223    1.8204    11.113
## logwbc          5.376     0.1860    2.7794    10.398
## sex             1.370     0.7300    0.5621     3.338
## 
## Concordance= 0.851  (se = 0.041 )
## Likelihood ratio test= 47.19  on 3 df,   p=3e-10
## Wald test            = 33.54  on 3 df,   p=2e-07
## Score (logrank) test = 48.01  on 3 df,   p=2e-10
summary(boxTidwelllwbc3)
## Call:
## coxph(formula = Surv(survt, status) ~ rx + logwbc + sex + log_wbc.times.lwbc3, 
##     data = leuk)
## 
##   n= 42, number of events= 30 
## 
##                         coef exp(coef) se(coef)      z Pr(>|z|)    
## rxUntreated          1.51094   4.53101  0.46119  3.276  0.00105 ** 
## logwbc               1.77287   5.88773  0.53783  3.296  0.00098 ***
## sex                  0.31915   1.37595  0.45557  0.701  0.48359    
## log_wbc.times.lwbc3 -0.06612   0.93602  0.30445 -0.217  0.82807    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## rxUntreated             4.531     0.2207    1.8350     11.19
## logwbc                  5.888     0.1698    2.0518     16.89
## sex                     1.376     0.7268    0.5634      3.36
## log_wbc.times.lwbc3     0.936     1.0684    0.5154      1.70
## 
## Concordance= 0.85  (se = 0.04 )
## Likelihood ratio test= 47.23  on 4 df,   p=1e-09
## Wald test            = 33.73  on 4 df,   p=8e-07
## Score (logrank) test = 48.68  on 4 df,   p=7e-10
#check for influential observations

sex_treat_adj.mod<-coxph(Surv(survt, status)~rx + logwbc + sex, leuk)
summary(sex_treat_adj.mod)
## Call:
## coxph(formula = Surv(survt, status) ~ rx + logwbc + sex, data = leuk)
## 
##   n= 42, number of events= 30 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)    
## rxUntreated 1.5036    4.4978   0.4615 3.258  0.00112 ** 
## logwbc      1.6819    5.3760   0.3366 4.997 5.82e-07 ***
## sex         0.3147    1.3698   0.4545 0.692  0.48872    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## rxUntreated     4.498     0.2223    1.8204    11.113
## logwbc          5.376     0.1860    2.7794    10.398
## sex             1.370     0.7300    0.5621     3.338
## 
## Concordance= 0.851  (se = 0.041 )
## Likelihood ratio test= 47.19  on 3 df,   p=3e-10
## Wald test            = 33.54  on 3 df,   p=2e-07
## Score (logrank) test = 48.01  on 3 df,   p=2e-10
ggcoxdiagnostics(sex_treat_adj.mod, type = "dfbeta", sline=FALSE, ggtheme = theme_bw()) #the pattern should not change the beta by a large degree. Here the betas for logwbc change by a factor of less than +/-0.2 for all covariates 

#Compare to model without influential id, pick 22, the HR should go up for sex because the dfbeta residual is negative
sex_treat_adj.mod<-coxph(Surv(survt, status)~rx + logwbc + sex, leuk[which(leuk$id!=22),])
summary(sex_treat_adj.mod)
## Call:
## coxph(formula = Surv(survt, status) ~ rx + logwbc + sex, data = leuk[which(leuk$id != 
##     22), ])
## 
##   n= 41, number of events= 29 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)    
## rxUntreated 1.6068    4.9868   0.5136 3.129  0.00176 ** 
## logwbc      1.5646    4.7807   0.3489 4.484 7.32e-06 ***
## sex         0.4956    1.6415   0.4901 1.011  0.31191    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## rxUntreated     4.987     0.2005    1.8224    13.645
## logwbc          4.781     0.2092    2.4127     9.473
## sex             1.642     0.6092    0.6281     4.290
## 
## Concordance= 0.849  (se = 0.041 )
## Likelihood ratio test= 45.46  on 3 df,   p=7e-10
## Wald test            = 32.55  on 3 df,   p=4e-07
## Score (logrank) test = 47.25  on 3 df,   p=3e-10
#Compare to model without influential id, pick 38, the HR should go down for sex because the dfbeta residual is positive
sex_treat_adj.mod<-coxph(Surv(survt, status)~rx + logwbc + sex, leuk[which(leuk$id!=37),])
summary(sex_treat_adj.mod)
## Call:
## coxph(formula = Surv(survt, status) ~ rx + logwbc + sex, data = leuk[which(leuk$id != 
##     37), ])
## 
##   n= 41, number of events= 29 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)    
## rxUntreated 1.4268    4.1652   0.4712 3.028  0.00246 ** 
## logwbc      1.8252    6.2038   0.3552 5.138 2.78e-07 ***
## sex         0.1747    1.1909   0.4744 0.368  0.71269    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## rxUntreated     4.165     0.2401     1.654    10.489
## logwbc          6.204     0.1612     3.092    12.446
## sex             1.191     0.8397     0.470     3.017
## 
## Concordance= 0.865  (se = 0.04 )
## Likelihood ratio test= 48.45  on 3 df,   p=2e-10
## Wald test            = 33.22  on 3 df,   p=3e-07
## Score (logrank) test = 48.54  on 3 df,   p=2e-10